影响红酒质量的因素评估

========================================================

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

先整体的看一下各个因子情况及数量分布

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

从图中可以看到,氯化物的含量的分布比较单一,就是说1600条记录中没有太大差别。 因为分析的方向是找到影响红酒质量的因子,所以先看一下测试数据中红酒的质量是如何分布的。

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

通过分析数据发现,这里包含的1599条数据中,酒的质量评分在3~8分之间。 没有评分非常高(10)和评分非常低的(1)的数据,这样的数据组成对分析有点不理。

Univariate Plots Section

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

了解下所分析的红酒中,不同的糖分含量所占的数量。 通过图形可以看到红酒的甜度存在长尾效应,所以去掉数量最少的部分,可以更好的看到数据的趋势和分布。

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.274   2.500   5.000

通过图像分析,我们可以看到大部分的甜度在1.5~2.5之间。(g / dm^3)

ggplot(red_wine, aes(alcohol)) +
  geom_histogram(binwidth = 0.1) +
  geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
  geom_vline(xintercept = mean(red_wine$alcohol), color = 'coral')

summary(red_wine$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

由于氯化物含量的数据比较集中,所以选择更小的范围来观察因子。

ggplot(red_wine, aes(x = chlorides)) +
  geom_histogram() +
  xlim(quantile(red_wine$chlorides, 0.05), quantile(red_wine$chlorides, 0.95)) +
  xlab("chlorides (middle 95%)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 158 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

summary(subset(red_wine$chlorides,
               red_wine$chlorides < quantile(red_wine$chlorides, 0.95)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07800 0.07914 0.08800 0.12600
ggplot(red_wine, aes(x=density)) +
  geom_density() +
  stat_function(linetype = 'dashed',
                color = 'royalblue',
                fun = dnorm,
                args = list(mean = mean(red_wine$density), sd = sd(red_wine$density)))

Univariate Analysis

What is the structure of your dataset?

文档中包含了1599条记录,每一条记录包含了12个属性。

What is/are the main feature(s) of interest in your dataset?

是什么因素导致了红酒质量的变化。 但是数据中的红酒的评分的范围在3~8分之间,所以没有特别好的酒和特别差的酒。5.6360225

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

目前只探究了单一变量的一些数据情况,还没有办法知道哪些因素是影响红酒的质量的元素

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

为了更好的了解因子之间的相关性,所以先列出两两因子之间的相关性数据。

round(cor(red_wine), 3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.256       0.672
## volatile.acidity            -0.256            1.000      -0.552
## citric.acid                  0.672           -0.552       1.000
## residual.sugar               0.115            0.002       0.144
## chlorides                    0.094            0.061       0.204
## free.sulfur.dioxide         -0.154           -0.011      -0.061
## total.sulfur.dioxide        -0.113            0.076       0.036
## density                      0.668            0.022       0.365
## pH                          -0.683            0.235      -0.542
## sulphates                    0.183           -0.261       0.313
## alcohol                     -0.062           -0.202       0.110
## quality                      0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                1.000   0.071 -0.066     0.043  -0.206
## density                             0.071   1.000 -0.342     0.149  -0.496
## pH                                 -0.066  -0.342  1.000    -0.197   0.206
## sulphates                           0.043   0.149 -0.197     1.000   0.094
## alcohol                            -0.206  -0.496  0.206     0.094   1.000
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                1.000

其中相关性数据大于0.5,或者小于-0.5左右的会重点进行分析。 其中比较关注的存在正相关的因子包含密度和固态酸,质量和酒精浓度比较正相关(0.476) 酒精浓度和密度存在负相关

那么我们重点来看一下质量和酒精浓度的关系,

ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_point()

但是我们发现很多重合的点。 所以我们在图形中添加噪声,这样可以更好的看到数据的趋势。

ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_jitter(alpha = 0.25) +
  geom_smooth(method = "lm") +
  labs(
       title = 'Relation between alcohot and quality')

从图形中来看,红酒的质量和酒精浓度有点正相关,相关度为0.476

red_wine$quality <- factor(red_wine$quality)

ggplot(red_wine, aes(y = alcohol, x = quality)) +
  geom_boxplot(alpha = 0.1, color = 'blue') +
  stat_summary(fun.y = 'mean', geom = 'point', color = 'red') +
  geom_jitter(alpha = 0.1) +
  labs(x = 'quality',
       y = 'alcohol',
       title = 'Boxplot of each quality')

red_wine$quality <- factor(red_wine$quality)

ggplot(red_wine, aes(y = volatile.acidity, x = quality)) +
  geom_boxplot(alpha = 0.1, color = 'blue') +
  stat_summary(fun.y = 'mean', geom = 'point', color = 'red') +
  geom_jitter(alpha = 0.1) +
  labs(x = 'quality',
       y = 'volatile.acidity',
       title = 'Boxplot of each quality')

ggplot(red_wine, aes(x = fixed.acidity, y = pH)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = 'lm') +
    labs(x = 'fixed.acidity',
       y = 'pH',
       title = 'Relation between ph and fixed.acidity')

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

所有属性中,和红酒质量评分有较高相关性的属性就是“酒精浓度”,相关性达到了0.476 而红酒的质量又和挥发酸有比较强的负相关,挥发酸越强,红酒的质量相对较差。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

密度和酒精浓度有很强的负相关,这个挺意外的,可能是因为自己对红酒的组成元素一点都不了解吧。。。

What was the strongest relationship you found?

酸度和PH值的负相关性,这个比较好猜,酸度越大PH值越低。

Multivariate Plots Section

对于多个变量在研究时,选择质量这样的类别参数作为颜色变色的变量 首先选择在两元变量分析中和质量存在相关的两个变量,包含酒精含量和密度

red_wine$quality <- factor(red_wine$quality)

ggplot(red_wine, aes(x = density, y = alcohol, color = quality)) +
  geom_jitter() +
  scale_color_brewer(type = 'div', palette = "PuOr") +
  coord_cartesian(xlim = c(0.985, 1.0)) + 
   labs(
       title = 'Scatterplot between density and alcohol with colored quality level')

所以从图中可以看到评分高的酒,酒精含量比较高同时浓度比较高

red_wine$quality <- factor(red_wine$quality)

ggplot(red_wine, aes(x = alcohol, color = quality, y = citric.acid)) +
  geom_jitter() +
  scale_color_brewer(type = 'div', palette = 'PuOr') + 
     labs(
       title = 'Scatterplot between acid and alcohol with colored quality level')

所以从图中可以看到评分高的酒,酒精含量比较高同时柠檬酸也比较高。 由于二元变量(挥发酸和质量)的负相关比较强,同时我们知道酒精含量和红酒的质量有比较强的相关性, 所以这里将这三个变量的关系一起分析。

ggplot(red_wine, aes(x = alcohol, color = quality, y = volatile.acidity)) +
  geom_jitter() +
  scale_color_brewer(type = 'div', palette = 'PuOr') + 
     labs(
       title = 'Scatterplot between volatile.acidity and alcohol with colored quality level')

所以我们可以看到这里深色的点分布比较集中 当酒精的含量较高,并且挥发酸度低时,红酒的质量会普遍高一点

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

通过图像表示,酒精浓度提升,挥发酸下降时,相应的红酒质量是提升的。

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

ggplot(red_wine, aes(alcohol)) +
  geom_histogram(binwidth = 0.1) +
  geom_vline(xintercept = median(red_wine$alcohol), color = 'royalblue') +
  annotate('text',
           x = median(red_wine$alcohol) - 0.35,
           y = 120,
           label = paste('median\n(', median(red_wine$alcohol), ')', sep = ''),
           color = 'royalblue') +
  geom_vline(xintercept = mean(red_wine$alcohol), color = 'red') +
  annotate('text',
           x = mean(red_wine$alcohol) + 0.35,
           y = 120,
           label = paste('mean\n(', round(mean(red_wine$alcohol), 2), ')', sep = ''),
           color = 'red') +
  xlab("Alcohol (%)") +
  ylab("Numbers") + 
  labs(title = "Histograms of alcohol")

Description One

根据数据酒精浓度和红酒质量存在相关性,所以这里了解下不同酒精浓度的数量情况。 可以看到均值(10.2)小于中位数(10.42)

Plot Two

ggplot(red_wine, aes(x = alcohol, y = quality)) +
  geom_jitter(alpha = 0.1, height = 0.48, width = 0.025) +
  geom_smooth(method = "lm") +
  ggtitle("Quality vs Alcohol Content") +
  xlab("Alcohol (%)") +
  ylab("Quality (0-10)")

从二元变量的相关性分析中,可以明确的看到酒精含量和质量有较强的正相关。

Description Two

这里展示了两个变量之间的相关性。

Plot Three

ggplot(red_wine, aes(x = alcohol, color = quality, y = volatile.acidity)) +
  geom_jitter() +
  scale_color_brewer(type = 'div', palette = 'PuOr') + 
  ggtitle("Quality by Volitile Acidity and Alcohol") +
  xlab("Alcohol (%)") +
  ylab("Volitile Acidity (g/L)")

Description Three

这里展示了当红酒的质量提升时,对应的酒精含量上升同时挥发酸下降。


Reflection

  1. 首先了解整个数据结构,一共存在1599条记录,每条记录中包含了和酒质量相关的12个属性
  2. 但是由于要研究红酒质量和什么成分有关,但是样本数据中却没有高质量(大于8分)的酒的数据,所以可能会影响整体分析结果。
  3. 通过分析两两数据的相关性,初步锁定红酒的质量和酒精浓度有一定的相关性
  4. 在锁定了1个变量之后(酒精的含量),再去需要是不是有第2个元素对红酒质量的影响,所以发现了挥发酸这个属性与酒精质量成反相关。
  5. 现在的一些分析,还只是初步证明红酒质量高的就有一些特点,如高酒精含量,高柠檬酸,和较低的挥发酸。但是还不能证明这些因素是优质红酒的充分条件。
  6. 同时在分析时,会圈定一个有强相关的银子,之后就围绕这个因素添加第二个因素,但是现在高质量的数据点还是比较分散,如果有更多数据就好了。